16 - MLPDES25: The condensation phenomenon of Deep Neural Networks [ID:57536]

50 von 242 angezeigt

Thanks a lot. It's really a nice day to see a lot of mathematicians stepping into the

area of trying to understand the mathematical foundation of deep neural networks. I've been

into this area since 2018 and it's been seven years. So it's really nice to see how this

field evolves and with new excitement. First of all, I want to just show you that now we

are kind of using models with increasing a large size. So you see like we have like GPT-2,

like 3 and 4. Maybe we don't know, it's just a guess. And also like DeepSeq,

and like DeepSeq v3 has 600 billion parameters and the new one, the leak is that it's even larger.

So why we are using these kind of very large systems to learn stuff. But we can observe that

in the evolution of the mammals and we can see from mouse to our human brain, actually the size

increased a lot. And that leads to the question. So it's a very, very large model with so many

parameters. Do we suffer from overfitting? Since we have a lot more degrees of freedom than the

mouse, should we worry about that? And the traditional wisdom tells us that if we have

some data, maybe we don't have enough, just infinite many data, probably we should start

with a small model, meaning that models with less parameters to start with. And if we blindly use

a large model, and then we could easily suffer from the problem of overfitting. And although we got

very good training error, however, it does not really tell you or give you the generalization

capability. So the traditional wisdom actually with lots of justification, for example,

justification for numerical analysis or from the statistical learning theory, there's all different

justifications of that, like context models easily overfit. And there's even philosophical

justification from the outcome reader that telling you that you shouldn't really consider a very

large model. However, as we can see from the success of like, for example, large network models,

and the scaling law tells you that even you just have very limited data, still using large models

benefits. So that's kind of really weird. And therefore, that's the sometimes we call apparent

paradox, or there's other other names that tells you, okay, that's the thing we must explain in

order to have a mathematical understanding. So when do we realize this problem? So start at least

from the Leo Bremer, very famous statistician. And in 1995, he poses four problems that are actually

not that well solved even till now. And particularly the first one, he says that if we have a very

heavily parameterized neural network, so how to understand its non-overfitting behavior, and we do

get some progress and some of you probably know about the Neurotangent Kernel theory. However,

the Neurotangent Kernel theory actually tells us that neural networks sometimes resembles a kernel

method. But in which sense the neural network model is superior to kernel methods? We don't really

understand. Therefore, the whole phenomenon I'm going to tell you is about this kind of

nonlinear behavior. I'm trying to understand how these overparameterized neural networks

control the complexity of the output function during the nonlinear training process. And you

can see if we have like we let the network to increase a complexity arbitrarily fast,

and then overfitting is actually unavoidable. It seems that there's some kind of nonlinear behavior

that helps the generalization. Okay, and then I will tell you what is a condensation phenomena.

So first of all, let's go to this very intuitive illustration. So we have this one hidden layer

neural network with five neurons. And initially, and all these ways are initialized randomly from

Gaussian distribution. So they are all different. However, during the training, you can observe the

following behavior where some of these input weights becomes equal to one another. And for

example, it's one and two and three, four and five becomes almost similar to one another. And if that

happens, we say, okay, it's a condense. And when the neuron is in a condensed state, and you can

see that we can combine different neurons into fewer neurons. So it is equivalent to a smaller

network. So if that really happens during the training kind of automatically, then it must be

a mechanism to control the complexity during the training. Okay, and let's look at the real

phenomenon. So here we have a one hidden layer neural network with like hundreds of neurons,

and the fitting is just one dimensional fitting. So with this blue dots are training points.

So here each red dot represent a neuron. And we can see that these wj, the x axis is the wj,

and the y axis is the bj. So each neuron, their orientation means the feature. And each the feature

Teil einer Videoserie :

MLPDES25 • Machine Learning and PDEs Workshop

Presenters

Prof. Dr. Yaoyu Zhang

Zugänglich über

Offener Zugang

Dauer

00:30:27 Min

Aufnahmedatum

2025-04-29

Hochgeladen am

2025-04-30 15:43:25

Sprache

en-US

https://mod.fau.eu/mlpdes25/

#MLPDES25 Machine Learning and PDEs Workshop

Mon. – Wed. April 28 – 30, 2025

HOST: FAU MoD, Research Center for Mathematics of Data at FAU, Friedrich-Alexander-Universität Erlangen-Nürnberg Erlangen – Bavaria (Germany)

https://mod.fau.eu/mlpdes25/

SPEAKERS

• Paola Antonietti. Politecnico di Milano
• Alessandro Coclite. Politecnico di Bari
• Fariba Fahroo. Air Force Office of Scientific Research
• Giovanni Fantuzzi. FAU MoD/DCN-AvH, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Borjan Geshkovski. Inria, Sorbonne Université
• Paola Goatin. Inria, Sophia-Antipolis
• Shi Jin. SJTU, Shanghai Jiao Tong University
• Alexander Keimer. Universität Rostock
• Felix J. Knutson. Air Force Office of Scientific Research
• Anne Koelewijn. FAU MoD, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Günter Leugering. FAU, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Lorenzo Liverani. FAU, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Camilla Nobili. University of Surrey
• Gianluca Orlando. Politecnico di Bari
• Michele Palladino. Università degli Studi dell’Aquila
• Gabriel Peyré. CNRS, ENS-PSL
• Alessio Porretta. Università di Roma Tor Vergata
• Francesco Regazzoni. Politecnico di Milano
• Domènec Ruiz-Balet. Université Paris Dauphine
• Daniel Tenbrinck. FAU, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Daniela Tonon. Università di Padova
• Juncheng Wei. Chinese University of Hong Kong
• Yaoyu Zhang. Shanghai Jiao Tong University
• Wei Zhu. Georgia Institute of Technology

SCIENTIFIC COMMITTEE

• Giuseppe Maria Coclite. Politecnico di Bari

• Enrique Zuazua. FAU MoD/DCN-AvH, Friedrich-Alexander-Universität Erlangen-Nürnberg

ORGANIZING COMMITTEE

• Darlis Bracho Tudares. FAU MoD/DCN-AvH, Friedrich-Alexander-Universität Erlangen-Nürnberg

• Nicola De Nitti. Università di Pisa

• Lorenzo Liverani. FAU DCN-AvH, Friedrich-Alexander-Universität Erlangen-Nürnberg

SEE MORE: https://mod.fau.eu/mlpdes25/

Video teaser of the #MLPDES25 Workshop: https://youtu.be/4sJPBkXYw3M

#FAU #FAUMoD #MLPDES25 #workshop #erlangen #bavaria #germany #deutschland #mathematics #research #machinelearning #neuralnetworks

Tags

Per RSS abonnieren